explore(agent-wiki): trajectory-derived wiki — skills, builder, experiments by vinodmut · Pull Request #268 · AgentToolkit/altk-evolve

vinodmut · 2026-06-10T05:54:54Z

Related to #256 — this is a prototype of offline trajectory-mining + consolidation ("dreaming"): reviewing saved trajectories to extract, consolidate, deduplicate, and curate memory outside the main task loop, with an auditable record of what changed.

What this is

An exploration in turning agent trajectories into a reusable, evidence-grounded wiki that future agents consult before acting — plus the experiments measuring whether it helps. Everything lives self-contained under explorations/agent-wiki/.

The core idea: after an agent finishes a task, distill its trajectory into wiki pages — episodic summaries, atomic guidelines, themed cluster pages, and executable skills — each linked back to the trajectory that produced it. A future agent, pointed at the wiki's AGENTS.md, retrieves the pages relevant to its task and applies them instead of re-deriving the recipe.

How this maps to #256 ("dreaming")

#256 asks for	provided here
extract useful memories from raw trajectories after the fact	`agent-wiki-summarize` / `-extract-guidelines` / `-synthesize-skill` (retroactive + batch ingest)
consolidate duplicate / overlapping guidelines	`agent-wiki-consolidate-guidelines` → cluster pages
promote repeated observations; detect stale / redundant entities	delete-on-promote (`--archive-covered`), recall roll-up, priority tiers
auditable summary of what changed and why	`_audit.log` + provenance back-links on every page

Layout

explorations/agent-wiki/
├── skills/        7 agent-wiki skills + build_agent_wiki.py (reference copy)
├── docs/          design.md (rationale) + schema.md (on-disk format)
├── experiments/   RESULTS-SUMMARY + comparison reports; metrics/ rollups; harness/ scripts
└── wikis/         worked examples: wiki-twobatch {base, skills, both, pruned}

Headline findings (`experiments/RESULTS-SUMMARY.md`)

Wiki vs no wiki: −20% cost, −38% duration, −43% tool calls at unchanged accuracy (16-task A/B).
Skills > guidelines: a skills-only wiki beats a guidelines-only one on cost (−14%) and matches accuracy.
Pointer wording is load-bearing: a strong-imperative CLAUDE.md pointer is read 3/3; a soft one 1/3.
Composition > size: piling guidelines on top of skills is the worst populated wiki; delete-on-promote (archive skill-covered atomics) beats it but skills-only stays cheapest.

Scope / data note

These are benchmark-derived example wikis (a synthetic 16-task file-format corpus). Raw per-trial sandbox transcripts and any wikis built from internal trajectory corpora are intentionally excluded — only metric rollups, narrative reports, and the benchmark-derived wikis are included. Source links in wiki frontmatter are shown in the generic trajectories/<session-id>.json form. The skills are a standalone reference copy, not wired into a plugin loader.

Summary by CodeRabbit

New Features
- Added an agent-wiki exploration with CLI experiment and analysis tooling to run, normalize, score, and compare wiki-consult experiments and render summary reports.
Documentation
- Large collection of design, schema, skill, and experiment write-ups describing wiki formats, ingestion/synthesis/consolidation workflows, experiment results, and usage guides.
Chores
- Excluded generated example wiki content from secret scanning and lint/type checks; updated secret-scan baseline and configs.

Adds explorations/agent-wiki/ — the agent-wiki skill family, builder, design + schema docs, the wiki-helps experiment reports, and benchmark-derived example wikis, all under one tree suitable for a public PR. Contents: - skills/ 7 agent-wiki skills + build_agent_wiki.py (reference copy, not plugin-wired) - docs/ design.md + schema.md - experiments/ RESULTS-SUMMARY + twobatch comparison reports + pruned-index-hypothesis; metrics/ rollups (no raw transcripts); harness/ runner + compare scripts - wikis/ wiki-terminalbench-bob + the twobatch arms (base / skills / both / pruned-corrected) Public-safety scrub: - Excluded all raw per-trial sandbox transcripts (kept only metric rollups + narrative reports). - Excluded wikis built from internal corpora (procedural-design, consult-meta, iterative, retroactive, simple-claude, test-paired, claude) and the build-pattern comparison that ran on them; §3-4 of RESULTS-SUMMARY reduced to a portable-finding note. - Rewrote all source-path frontmatter to the generic trajectories/<session-id>.json form; genericized internal example names and the benchmark-data dir convention in skills/docs. - Leak gate (benchmark-data / internal corpus + wiki names / org paths) passes with zero hits across the tree. Branched off main; diff touches only explorations/agent-wiki/. Builder catalog + comparison scripts verified runnable from the new location.

Removes the terminal-bench example wiki from the exploration. Repoints the README reading-order + layout to wiki-twobatch-skills, fixes the docs that attributed worked examples to it (schema.md now points at the wiki-twobatch arms; example index rows retagged), and corrects stale relative links the docs carried from the original tree (../plugin-source → ../skills, ../WIKIS.md removed, ../experiments/wiki-build-comparison.md → RESULTS-SUMMARY §3–4, design.md/schema.md cross-links to renamed filenames). Skill example paths (consult, ingest) repointed off the removed wiki. Remaining wikis: wiki-twobatch {base, skills, both, pruned}. All intra-doc relative links resolve; leak gate clean.

coderabbitai · 2026-06-10T05:55:03Z

Warning

Review limit reached

@vinodmut, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 24 minutes and 56 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c6892d64-bc1e-48c5-a3bf-2141cc315c86

📥 Commits

Reviewing files that changed from the base of the PR and between 26c2884 and 7a1f5ac.

📒 Files selected for processing (3)

.pre-commit-config.yaml
.secrets.baseline
explorations/agent-wiki/experiments/harness/experiment_wiki_consult.py

📝 Walkthrough

Walkthrough

Adds an agent-wiki exploration: README, design and schema docs, skills/templates and default configs, a Dockerized experiment harness and 17-task suite, transcript normalization and metric-extraction tooling, multi-arm comparison scripts and reports, populated JSONL metric datasets, and repo config updates to exclude generated wikis from scans and linters.

Changes

Agent-Wiki Framework Design & Schema

Layer / File(s)	Summary
Framework README `explorations/agent-wiki/README.md`	Repository-level README introducing the agent-wiki concept, layout, reading order, and scope constraints.
Design principles & pipeline `explorations/agent-wiki/docs/design.md`	Design doc specifying provenance, page kinds, lifecycle rules (consolidate/delete-on-promote), pipeline ordering, build patterns, and experimental evidence summary.
On-disk schema & contracts `explorations/agent-wiki/docs/schema.md`	Schema reference for page kinds, YAML frontmatter, index/config/audit artifacts, linking rules, promotion/archival mechanics, and worked examples.

Experimental Validation & Result Analysis

Layer / File(s)	Summary
Experiment harness & task suite `explorations/agent-wiki/experiments/harness/experiment_wiki_consult.py`, `explorations/agent-wiki/experiments/harness/wiki_consult_tasks.yaml`	CLI harness that creates per-trial workspaces, runs claude-sandbox sessions with condition-specific setup, parses stream-json to detect AGENTS.md/guideline reads and assistant text, scores signals, and writes runs/transcripts/summary; includes 17 prompt-driven tasks.
Transcript normalization & metrics extraction `explorations/agent-wiki/experiments/harness/normalize_stream_json_transcripts.py`, `explorations/agent-wiki/experiments/harness/extract_trial_metrics.py`	Normalize stream-json transcripts to an OpenAI-chat format and extract per-trial metrics (token counts, tool calls, wiki/index/guideline reads, durations, costs, outcome matching).
Comparison & reporting tools `explorations/agent-wiki/experiments/harness/twobatch_compare.py`, `.../threeway_compare.py`, `.../fourway_compare.py`, `.../fiveway_compare.py`	Scripts that load JSONL metric files, group by task/arm, compute medians/accuracies/deltas, and render Markdown comparison reports (aggregate, per-family, per-task).
Experiment reports & metrics `explorations/agent-wiki/experiments/RESULTS-SUMMARY.md`, `explorations/agent-wiki/experiments/-comparison.md`, `explorations/agent-wiki/experiments/metrics/.metrics.jsonl`, `explorations/agent-wiki/experiments/pruned-index-hypothesis.md`	Comprehensive experiment writeups and JSONL metric datasets (48–95 records per file) used for analysis and comparisons.

Operational Skills & Templates

Layer / File(s)	Summary
AGENTS.md template & default config `explorations/agent-wiki/skills/scripts/_default_agents.md`, `explorations/agent-wiki/skills/scripts/_default_agent_wiki_config.yaml`	AGENTS.md template and default YAML config describing consult contract, file layouts, tags/clusters/tasks, and examples.
Consult skill & retrieval contract `explorations/agent-wiki/skills/agent-wiki-consult/SKILL.md`	Consult-skill docs describing wiki root resolution, reading `AGENTS.md` and `_index.jsonl`, applying retrieval recipes, and surfacing ranked candidate matches.
Summarize / Extract / Synthesize / Ingest `explorations/agent-wiki/skills/*/SKILL.md`	Skills documentation covering per-trace summarization, guideline extraction (entities JSON schema), skill synthesis (skill JSON schema and render behavior), consolidation, task comparisons, and full ingest orchestration with step ordering and best-practices.

Repository Tooling & Configuration

Layer / File(s)	Summary
Repo config updates `.pre-commit-config.yaml`, `.secrets.baseline`, `pyproject.toml`	Exclude `explorations/agent-wiki/` from detect-secrets scanning (with comments), update `.secrets.baseline` exclude.files and a recorded sandbox README entry, and extend Ruff/MyPy excludes for generated wikis in `pyproject.toml`.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

visahak
illeatmyhat
gaodan-fang

Poem

🐰 I nibble through traces, stitch pages with care,
Little rules and skills bloom from trials laid bare.
From runs and metrics, a tidy guide I spin—
One rabbit's hops turn many agents' win. 🥕✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 17.14% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'explore(agent-wiki): trajectory-derived wiki — skills, builder, experiments' directly and comprehensively summarizes the main addition: an exploration of an agent-wiki system with skills, builder, and experiments. It is clear, specific, and accurately reflects the changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

CI (ruff, mypy, detect-secrets) was scanning explorations/agent-wiki/ as project source — the first content under explorations/ to carry .py files and high-entropy identifiers. Fixes, scoped so generated example artifacts are treated like the already-excluded plugin-source/ and examples/ trees: - ruff: lint + format fixes in the harness scripts + builder; exclude the generated wiki scripts (explorations/agent-wiki/wikis/) via extend-exclude. - mypy: add explorations/agent-wiki/wikis/ to exclude; add file-local `# mypy: ignore-errors` to the exploration harness + the builder (a verbatim copy of the mypy-excluded plugin-source/ original). - detect-secrets: exclude explorations/agent-wiki/ in the pre-commit hook and .secrets.baseline — the 53 findings are 12-hex guideline content hashes and session-id UUIDs, not secrets. No example-wiki content changed (scripts keep their original names). Fixes failing CI checks: check-formatting, check-linting, check-typing, tekton/pr-code-checks/code-detect-secrets.

Drops explorations/agent-wiki/wikis/ (253 generated files, ~10k lines) from this PR so the diff is the reviewable surface — skills, builder, docs, and the experiment reports/harness (~34 files). The example wikis are machine- generated output; bundling them buried the code and appears to have made CodeRabbit skip deep review (summary only, zero inline findings). The wikis land in a stacked follow-up PR. README/docs still reference wikis/wiki-twobatch-* by path; those links resolve once the follow-up merges. Root-config excludes (ruff/mypy/detect-secrets) are kept — the detect-secrets exclude still covers example content hashes in docs/schema.md, and the wiki excludes become live again when the follow-up lands.

vinodmut · 2026-06-10T07:07:45Z

Split the generated example wikis into a companion PR #269 (merge after this one) so this diff stays focused on the reviewable code — builder, skills, docs, and experiment harness (34 files vs the original 287). This should let CodeRabbit review the code properly.

vinodmut · 2026-06-10T07:09:19Z

@coderabbitai review

coderabbitai · 2026-06-10T07:09:27Z

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 7

🧹 Nitpick comments (13)

explorations/agent-wiki/skills/agent-wiki-synthesize-skill/SKILL.md (1)

205-213: ⚡ Quick win

Add language specifier to fenced code block.

The directory structure example should use text or similar language identifier for consistency.

📝 Suggested fix

-```
+```text
 <wiki>/skills/
 ├── _id_index.json                     skill slug → relpath
 ├── index.md                           alphabetical listing (auto-generated)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@explorations/agent-wiki/skills/agent-wiki-synthesize-skill/SKILL.md` around
lines 205 - 213, Update the fenced code block in SKILL.md that shows the
directory tree for "<wiki>/skills/" to include a language specifier (e.g.,
change the opening ``` to ```text) so the block is marked as plain text; locate
the block in the SKILL.md content that begins with the three backticks followed
by the tree and replace the opening fence accordingly to ensure consistent
formatting.

Source: Linters/SAST tools

explorations/agent-wiki/skills/agent-wiki-consult/SKILL.md (2)

53-55: ⚡ Quick win

Add language specifier to fenced code block.

The code block should specify bash as the language for proper syntax highlighting and consistency with the rest of the documentation.

📝 Suggested fix

-```
+```bash
 Read <wiki-root>/AGENTS.md

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>